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Abstract 

We present an algorithm, along with its implementation, to approximate single-qubit unitaries 
using quantum circuits consisting of Clifford and T gates. In addition to meeting the known logarith- 
mic lower bounds on the number of gates required to approximate a unitary by a quantum circuit, 
we give computational evidence that a very close to the best, or the best existing approximation for 
unitaries of the type diag{l,exp{i(j))} were indeed found. In particular, the quality of our approx- 
imation is determined by the ability of the PARI software package to find the solution of a certain 
type of Diophantine equation, and the choice of internal parameter Delta determining the size of a 
computer search. We have furthermore structured our search to guarantee that our near-optimal 
approximations can be found for reasonable error parameters — currently, down to 10~^^, allowing to 
execute very long quantum circuits. We discuss how to improve our implementation further to handle 
even smaller errors (by a few orders of magnitude) that would enable the high-precision synthesis of 
even larger quantum algorithms. 

1 Introduction 

The problem of single-qubit circuit synthesis is important for efficient quantum computation. A quantum 
algorithm is most often described in terms of a high level circuit /procedure whose elements could be 
multiple-qubit transformations such as arithmetic operations (addition, multiplication, exponentiation) 
or special purpose transforms, such as the quantum Fourier transform (QFT). These large transforms are 
then decomposed into high level logical gates, such as Toffoli, Fredkin, SWAP, arbitrary two-qubit gates, 
etc. Those gates are further broken down into CNOT and single-qubit gates [•'>] , and, finally, these gates 
are broken down into, or approximated by, circuits of gates from the available computational gate set. 
Motivated by the state-of-the-art methods for fault-tolerant quantum computation [1, 9, 15], we focus 
on the computational gate set consisting of Clifford gates and the T gate (exact synthesis is not always 
possible [10, I-']]). The result is a logical circuit written using Clifford (for the purpose of this paper, 
defined as the set of Pauli-X, Y, Z gates. Phase gate, Hadamard gate, and the CNOT) and T gates. 



Furthermore, in the context of fault-tolerant implementations [7, 9] Clifford gates have a relatively low 
implementation cost compared to the T gates [1,8]. As a result, it is common to measure the circuit 
cost in terms of the number of T gates required. 

The problem of approximating an arbitrary single-qubit unitary by a quantum circuit has been studied 
well in the relevant literature. One of the first results providing a polynomial time and polylogarithmic 
in the desired error solution was the Solovay-Kitaev algorithm [ ]. To approximate a single-qubit unitary 
to within error e, it takes 0(log^ ''^(l/e)) steps on a classical computer and the number of gates in the 
resulting quantum circuit it outputs is 0(log^'^''(l/£)). The best known upper bound on the circuit 
size resulting from the application of the Solovay-Kitaev algorithm is 0(log^^*(l/e)), where 6 can be 
chosen arbitrarily small [ 1 2]. A number of approaches have been developed that use additional resources 
in the form of ancillae, special states, classical feedback, or whose application results in a probabilistic 
success of having approximated a target unitary [11, 12]. These approaches improve over the resource 
estimates of the Solovay-Kitaev algorithm, however, fail to match the information-theoretic lower bound 
of r2(log(l/£)) for synthesizing a random unitary with precision s. Very recently, a new single-qubit 
synthesis algorithm has been announced that uses at most two ancillae prepared in the state |0) and 
guarantees a logarithmic number of gates in the resulting approximation [ ] ; the algorithm also runs in 
time polynomial in log(l/£). 

The circuits produced by [ • 4] are asymptotically optimal. While this means that the asymptotics have 
been settled up to a constant factor, those constant factors matter in the actual implementations. [14] 
reports three-qubit circuits approximating a single-qubit unitary, whereas in this paper we return to the 
problem of rounding off to single-qubit unitaries (as proposed in [14], Future Work) and exactly synthe- 
sizing single-qubit circuits for them based on [l.':5]. Using only a single qubit is already an improvement 
by a factor of three in terms of the total number of qubits required. Apart from using no ancillae, 
the synthesis algorithm used in this paper and the one reported in [14] are similar in that they rely on 
decreasing the power of the denominator of an element of the unitary over the ring Z[i, ^] to zero as 
a means of synthesizing the circuit. The algorithm reported in [14] uses three-qubit two-level unitaries 
(which [10] shows how to synthesize in an asymptotically optimal way, through an elegant generalization 
of the result in [l-)]). Each of these two- level unitaries may require a considerable number of T gates 
to be implemented as a Clifford and T circuit [2, 10]. Moreover, the number of two-level gates required 
to reduce the denominator by is at least two, since there are three non-zero entries in the matrix. 
We estimate that the number of T gates required to reduce the denominator by a factor of \/2 by the 
algorithm reported in [ ] may be about 20-100, whereas the experimental results reported in this paper 
show that the number of T gates required to reduce the denominator by a factor of \/2 in the circuits 
reported in this paper is only about 1.6. This results in a significant practical reduction of the resources 
required. Furthermore, our circuits reported in this paper are generated with a computational guarantee 
on their quality (defined by the parameter Delta) , therefore we expect that it may be difficult to optimize 
them further. We study how to approximate single-qubit unitaries with precisions of practical interest — 
as a result, our priority is the practical performance of the software implementation and bringing down 
the quantum resource requirements for approximations, being the T-gate counts (all other resources are 
minimal — e.g., we use no ancillae), that dominate the cost of the implementation of the single-qubit 
unitaries. 

The value e of the desired precision in approximating a single-qubit unitary affects what algorithm may 
be used to obtain the approximating circuit. Indeed, if the desired error is on the order of 10~^, a brute 
force breadth first search approach may be used to compute the approximating circuit. However, breadth 
first search appears to run out of classical computational resources for error values below 10"**. On the 
other hand, one may not need to approximate single-qubit unitaries to an excessively small precision. 
Approximation to a higher precision than needed comes at the expense of the large number of gates 
needed to accomplish it, and as such is not desired. We have found, via experimenting, that our software 
can readily handle approximation errors of 10^^^ without compromising the quality of the output as 
defined by the parameter Delta. Given the desired overall logical error of about 0.1%, such an error per 
diag{l,e'xp{i(/))} gate approximation allows execution of a quantum circuit containing as many as 10^'^ 
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single-qubit logical gates requiring approximation. This calculation is approximate, as it assumes that 10 
Rz gates are used to approximate a single qubit gate (in reality, it is at most 5), and, more importantly, 
that the logical errors add up, which, for most applications and approximations, is unlikely to be the 
case. Random and independent noise (such as that coming from approximations — and our software can 
be easily tuned to give a multitude of different approximations of about the same quality) scales as the 
square root of the sum of absolute values of all errors. Therefore, with careful approximation and design 
one might realistically hope to execute 10^® logical single-qubit gates with the overall error under 0.1%. 

We further note that for the computation of this size, each of the 10^'^ single qubit gates requires an 
approximation by at least 150 T gates (Tables 1 and 2), each of which takes 50 units of physical resources 
[. ] (two levels of state distillation seems reasonable for a computation of this size), therefore the total 
resource count is at least 7.5 * 10"'^^. In classical terms, 7.5 * 10^^ may be regarded as, approximately, a 
month long computation on a 200,000 MIPS processor (comparable to the best modern processors at the 
time of this writing). Given quantum computers are expected to be controlled by the classical computers, 
the operations count for classical computers may be a reasonable upper bound for the possible number 
of gates executable in quantum circuits. 

The above discussions motivated our choice for the compromise between the time spent on the calculations 
and the quality of the output. In particular, we noted that we can manage errors of the practical 
importance, and have thus invested additional time into computing a better approximating unitary — our 
results are accompanied by the computational guarantee that the circuits found are as small as it were 
possible to obtain. Motivated by the lessons from classical compilers and Electronic Design Automation, 
in the scenario when an algorithm may be compiled before its execution, well-optimized implementations 
precomputed earlier — such as ours — result in much better practical designs. 

The question of exact synthesis of quantum Clifford and T single-qubit circuits was studied in [ ' *]. 
Both papers report optimal implementations of the unitaries realized by the given circuits. In order 
to find high precision approximating circuits, they both rely on the Solovay-Kitaev algorithm. As a 
result, the number of gates in the approximating circuits scales as 0(log'^'^''(l/£)). Importantly, [ ] 
reports an algorithm that optimally decomposes a unitary over Z[i, ^] into a quantum Clifford and T 
circuit. We rely on this latter algorithm in our paper, as well as on the observation made in [.. ] that 
finding an approximating circuit is as difficult as finding the approximating unitary. Indeed, given the 
approximating unitary — a unitary over the ring the optimal circuit implementing it may be 

found in the number of steps linear in the number of gates such a circuit contains (i.e., as fast as one 
may have hoped). 



by unitaries over the ring Z[i, We use notation uj for eighth root of unity, e^u-i/s^ ^^jg^ note, 
similarly to [ ], that any single-qubit unitary can be decomposed in terms of a constant number of 
Hadamard gates and Rz{4') ([' ' ], solution to Problem 8.1). Therefore, the ability to approximate Rz{4>) 
implies the ability to approximate any single-qubit unitary. 

It suffices to approximate first column of the 2x2 matrix by a unit vector with entries in the ring Z[i, 
to approximate the entire matrix by a quantum circuit [ ] . Any such vector can be written using x and 
y from Z [uj] as i^{x,y) where x and y satisfy the following condition: 
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We define the overall quality of approximation as: 
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which is proportional to Frobenius distance between matrix and its approximation. 
There are two main steps in our algorithm: 

1. Determine x based on e~^'^/'^ aiming to minimize the approximation error, 

2. Find y such that the condition (1) is satisfied. 

There are two major difficulties in this approach. Firstly, the expression determining the quality of 
approximation is relatively complicated. Secondly, the requirement to satisfy condition (1) results in a 
system of Diophantine equations that does not always have a solution. 

Now we concentrate on the solution to the first problem. We treat the power of 2 in the denominator 
as the input parameter to our algorithm. It is not difficult to observe that we can rewrite the square of 
the expression for the quality of approximation as follows: 



X 

2" 
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As such, the problem is reduced to finding a suitable x. 



We find X using brute force search procedure. As (1) does not always have a solution we build a list of 
approximations of x sorted by the achieved accuracy of approximation. Then we try to solve (1) in the 
order of decreasing quality of the approximation. Once we succeed we output the unitary and produce 
its decomposition into a circuit relying on the exact synthesizer reported in [13]. 

In more details, to build a list of approximations of x, we first build lists RE and IM of approximations 
of 3ff(2"e~**/^) and 5(2"e~'"^/^), real and imaginary parts, by numbers of the form a + ^/2b with quality 
of approximation at least 2"^°'***, limiting \a\ < 2" and |6| < 2". Next we compute quantities: 

= 4" cos2(0/2) -\a + V2bf + |2" cos(0/2) - (a + V2b)f 

Dq = 4" sin^(0/2) - |c + V2d\^ + |(c + V2d) + 2" sin(0/2)|2, 

where a + V^b approximates 3?(2"e~''^/2) and c + V^d approximates 3(2"e~"^/^). Note that and 
Dcj sums together to the square of the overall quality of approximation that we are aiming to minimize 
multiplied by 4" . Having Dsr and Dq precomputed allows us to efficiently build a list of elements with 
overall quality of approximation below the selected threshold (3. We sort lists of Dsr and Dq. For each 
element a of Dsr we look up all elements of Dq in the range [a — /3,a + f3] using binary search. We found 
empirically that threshold value of 1 results in the list with size approximately equal to the number of 
elements in lists RE and IM. If we do not find a solution of the Diophantine equation for elements in 
the list we increase the threshold and consider elements with lower quality of approximation. 

To find y on the second step of the algorithm, we are already given x, that was obtained on the first 
step. Equation (1) can be written as: 

\y\'' ^ Ci + V2C2. (2) 

We find y via a reduction of the above equation to a well studied problem in algebraic number theory. 
We first introduce the required notations. The norm of an element a + V^b of the ring Z[V2] is defined 
as follows: 

N^{a + V2b) = - 2b^ 

In other words, it is computed by the multiplication of a + \/2b by its adjoint, a — \/2b. We also recall 
the definition of the norm of an element z in Z[a;]. Note that \z\'^ belongs to Z[-\/2]. The norm itself is 
defined as: 

iV(z) = iV^(lzp). 
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It is important to remember that the norm is a multiphcative function of the ring elements. 
The necessary condition that any solution of (1) must satisfy is: 

N{y) = iV^dyp) = N^,{C, + V2C2). 

The problem of finding a solution to the above equation is a known problem in algebraic number theory 
[18]. Even more, there exists an efficient algorithm that solves it for general extension rings. It is 
included in the open source number-theoretic library PARI [Ki] that we use as a part of our software 
implementation for unitary approximations. 

We next describe how to find a solution to our problem given the solution to the norm equation. There 
are two main issues here. The first is that the norm of the number is invariant with respect to the 
multiplication by units in the ring, being those elements with norm equal to 1. The second is that it is 
invariant with respect to taking the adjoint: 

N^iC\ + V2C2) = N^iCi - V2C2) 

Therefore, once we find y that is a solution to the norm equation, we check if divides Ci + V2C2- If 
it does not, we perform the following transformation: 

y = a + buj + coj'^ + du!^ i— >■ a — &a; + cw^ — du!^ , 

that effectively changes jyp to its adjoint in the ring 7L\\J'2\. In some rare cases |yp does not divide 
both Ci + \/2C2 and Ci + \/2C2- We have not investigated the reason for this, and simply skip such an 
approximation of x and try to solve the equation for a different approximation of x. 

Once we successfully found y such that \y\^ divides Ci + a/2C2, this means that we expressed 

u|2/P = (Ci + V2C2), 

for u being a unit in the ring Z[-\/2]. It is well known fact that units of 1\\f% are of the form (— 1)'^(-\/2± 
1)™. Therefore, we just need to find k and m corresponding to u. Given k and m we find y^, if it is 
defined, and obtain the solution to the equation: 

= Ci + %/2C2. 

We use a simple algorithm to find fc and m in the expression for unit u. We divide u by the powers of 
•\/2 ± 1 and check if the result, written in the form A + \/2B, has the absolute value of both coefficients 
A and B that are strictly less than the absolute value of the corresponding coefficients of u. This means 
that u included the corresponding power of basic units. We repeat this process until we get a 1 or —1. 

3 Experimental results 

We report experiments with the Rz{l/1Q), that we used to demonstrate the scaling of our approach. 
Future revisions will contain an expanded set of results. For our experiments we used a high performance 
server with eight Quad-Core AMD Opteron 8356 (2.30 GHz) processors and 128 GB of RAM memory. 
Our current algorithm implementation completely utilizes the processing power of the server and runs 
32 threads in parallel. 

Table 1 summarizes our results. The first column reports the power of 2 in the denominator of the approx- 
imating unitary. Second column reports the value of the parameter Delta that we use to adjust the size 
of the search space. The T-gate count shows the number of T gates required in the circuit implementing 
the approximating unitary (synthesized using [ ]). We use global phase invariant metric (Formula 1 in 
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Figure 1: The dependency of T-count on log2(l/e). With confidence level 0.999, the slope coefficient 
belongs to the interval [3.04,3.38] and the additive constant belongs to the interval [—13.34,-0.51]. 
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Table 1: The results of approximating Rz{l/1Q)- For n, the power of 2 in the denominator, Delta was 
chosen to be ["^/ZJ . 



6 



Power of 


Delta 


T gate 


Precision 


Approximation 


PARI 


denominator 


(2-™) 


count 




time (s) 


time (s) 


19 


4 


74 


4.99E-08 


2.3301 


27.379 


19 


9 


72 


5.10E-08 


0.2008 


19.890 


20 


5 


78 


1.94E-08 


2.4479 


26.532 


20 


10 


78 


2.20E-08 


0.1909 


17.955 


23 


5 


88 


7.36E-10 


22.615 


5.7688 


23 


11 


90 


1.94E-09 


0.8516 


35.749 


24 


6 


94 


2.46E-10 


28.278 


3.8146 


24 


12 


90 


4.40E-10 


0.9669 


8.2583 


25 


6 


100 


1.60E-10 


53.710 


15.435 


25 


12 


96 


3.54E-10 


1.7322 


45.083 


27 


6 


106 


1.61E-11 


211.62 


4.2888 


27 


13 


106 


4.04E-11 


4.1780 


20.981 


28 


7 


110 


1.49E-11 


242.06 


51.418 


28 


14 


112 


2.74E-11 


4.2127 


40.875 


30 


7 


116 


8.53E-13 


975.14 


2.4162 


30 


15 


116 


4.49E-12 


8.4449 


36.773 


31 


7 


120 


7.49E-13 


2000.3 


27.550 


31 


15 


122 


7.59E-13 


16.947 


7.6613 


32 


8 


124 


1.06E-13 


2359.0 


1.4174 


32 


16 


128 


5.07E-13 


18.060 


16.718 


33 


11 


124 


1.06E-13 


656.84 


19.465 


33 


16 


126 


1.88E-13 


35.986 


19.828 


34 


11 


130 


2.19E-14 


1481.0 


2.4933 


34 


17 


132 


9.50E-14 


40.933 


20.558 


35 


11 


136 


6.72E-15 


3111.2 


1.5797 


35 


17 


138 


1.39E-14 


79.948 


4.2597 



Table 2: Approximations of Rz{l/10). Included are the cases when decreasing Delta resulted in better 
approximations. 

[7]) to measure the achieved precision. The following two columns report the runtime, in seconds, of the 
most extensive parts of our computation — the time it took to build a list of approximations of e"*"^^^, 
and the time it took for PARI to solve the norm equation. 

Table 2 illustrates how increasing the size of the search space (decreasing the value of the parameter 
Delta) influences the performance of our algorithm and the quality of the results. The table shows 
only those cases when increasing the size of search space resulted in a better approximation. In our 
baseline experiment we used Delta equal to ['T./2J, where n is the power of 2 in the denominator. We 
also performed experiments with Delta equal to [n/3j and [?^/4J. With this choice of Delta we stopped 
the experiments at n = 35 and n = 32, correspondingly, as the available memory became a limitation. 
This is a limitation of the current implementation and can be solved by redesign of our software. 

Table 3 shows the comparison of our results to those recently published in [xi]. Our circuits feature the 
T-count smaller by about 22-30%, which is important in practice. 

4 Future work 

Our immediate research plans include software optimization, with the goal of improving the balance 
between computational speed and quality of approximation, as well as tighter implementation of our 
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Table 3: Comparison to the results reported in [17]. The precision is reported using Fowler's metric ["], 
but the comparisons are essentially the same using other metrics. 

algorithms. Our algorithm is highly parallelizable, and it can benefit greatly from the execution on the 
parallel hardware. We expect that the aggregate result of software optimizations and parallelization will 
help decrease the error parameter for which we are able to search approximations with the computational 
quality guarantee by a few to, possibly, several orders of magnitude. We further plan to report extensive 
benchmark results, most importantly, using Rz{^) single-qubit gates that are common in practical 
applications. While we employ an expensive algorithm for finding coordinate approximations, and faster 
approximating algorithms are possible — e.g., based on continued fractions approximation, those faster 
algorithms may compromise the guarantee on the quality of the resulting approximating circuits. On 
the other hand, it may be expected that those algorithms are able to handle much smaller errors. 

5 Conclusion 

In this paper, we reported an algorithm for construction of approximating circuits for single-qubit uni- 
taries. We targeted practical error sizes, and structured our search as to provide a form of exhaustive 
computational guarantee on the quality of the final result. The quality of circuits/approximations is 
defined by the parameter Delta. The error sizes we can handle, down to 10~^^, appear to be practical. 
Furthermore, our future optimizations are expected to make it possible to scale further. 
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